Missing Data and Imputation

Authors

Javier Estrada

Michael Underwood

Elizabeth Subject-Scott

Published

April 12, 2023

Website

Slides

Introduction

Missing Data

Missing data occurs when there are missing values in a dataset. There are many reasons why this occurs. It can be intentional or unintentional and can be classified into the following three categories, otherwise known as missingness mechanisms (Mainzer et al. 2023):

  • Missing completely at random (MCAR) is the probability of missing data being completely independent of any other variables.

  • Missing at random (MAR) is the probability of missing data being related to the observed values.

  • Missing not at random (MNAR) is the probability of missing data being dependent on the missing and observed values.

Figure 1: Graphical Representation of Missingness Mechanisms (Schafer and Graham 2002)

(X are the completely observed variables. Y are the partly missing variables. Z is the component of the cause of missingness unrelated to X and Y. R is the missingness.)

Looking for patterns in the missing data can help us to determine which category they belong. These mechanisms are important in determining how to handle the missing data. MCAR would be the best case scenario but seldom occur. MAR and MNAR are more common.

The problem with ignoring any missing values is that it does not give a true representation of the dataset and can lead to bias when analyzing. This reduces the statistical power of the analysis (van_Ginkel et al. 2020). To enhance the quality of the research, the following should be followed: explicitly acknowledge missing data problems and the conditions under which they occur and employ principled methods to handle the missing data (Dong and Peng 2013).

Methods to Deal with Missing Data

There are three types of methods to deal with missing data, the likelihood and Bayesian method, weighting methods, or imputation methods (Cao et al. 2021). Missing data can also be handled by simply deleting.

  • Likelihood Bayesian method is when information from a previous predictive distribution is combined with evidence obtained in a sample to predict a value. It requires technical coding and advanced statistical knowledge.

  • The weighting method is a traditional approach when weights from available data are used to adjust for non-response in a survey. Inefficiency occurs when there are extreme weights or a need for many weights.

  • The imputation method is when an estimate from the original dataset is used to estimate the missing value. There are two types of imputation: single and multiple.

Deleting missing data

Listwise deletion is when the entire observation is removed from the dataset. Deleting missing data can lead to the loss of important information regarding your dataset and is therefore not recommended. In certain cases, when the amount of missing data is small and the type is MCAR, listwise deletion can be used. There usually won’t be bias but potentially important information may be lost.

T-tests and chi-square tests can be used to assess pairs of predictor variables to determine whether the groups’ means differ significantly. According to van_Ginkel et al. (2020), if significant, the null hypothesis is rejected, therefore, indicating that the missing values are not randomly scattered throughout the data. This implies that the missing data is MAR or MNAR. Conversely, if nonsignificant, this implies that the data cannot be MAR. This does not eliminate the possibility that it is not MNAR–other information about the population is needed to determine this.

Whenever missing data is categorized as MAR or MNAR, listwise deletion would be wasteful, and the analysis biased. Alternate methods of dealing with the missing data is recommended: either pairwise deletion or imputation.

Pairwise deletion is when only the missing variable of an observation is removed. It allows more data to be analyzed than listwise deletion but limits the ability to make inferences of the total sample. For this reason, it is recommended to use imputation to properly deal with missing data.

Preferred Method to Handle Missing Data

Imputation is the preferred method to handle missing data. It consists of replacing missing data with an estimate obtained from the original, available data. After imputation, there will be a full dataset to analyze. According to Wulff and Jeppesen (2017), 3-5 imputations are sufficient, and 10 imputations are more than enough. More recent research has indicated that to improve statistical power, the number of imputations created should be at least equal to the percent of missing data (5% equals 5 imputations, 10% equals 10 imputations, 20% equals 20 imputations, etc.) (Pedersen et al. 2017).

Single, or univariate, imputation is when only one estimate is used to replace the missing data. Methods of single imputation include using the mean, the last observation carried forward, and random imputation. The following is a brief explanation of each:

  • Using the mean to replace a missing value is a straight-forward process. The mean of the dataset is calculated, including the missing value. The mean is then multiplied by the number of observations in the study. Next, the known values are subtracted from the product, and this gives an estimate that can be used for any missing values. The problem with this method is that it reduces the variance which leads to a smaller confidence interval.

  • Last Observation Carried Forward (LOCF) is a technique of replacing a missing value in longitudinal studies with a previously observed value, the most recent value is carried forward (Streiner 2008). The problem with this method is that it assumes that the previous observed value is perpetual when in reality that most likely is not the case.

  • Random imputation is a method of randomly drawing an observation and using that observation for any of the missing values. The problem with this method is that it introduces additional variability.

These single imputation methods are flawed. They often result in underestimation of standard errors or too small p-values (Dong and Peng 2013), which can cause bias in the analysis. Therefore, multiple imputation is the better method because it handles missing data better and provides less biased results.

Multiple, or multivariate, imputation is when various estimates are used to replace the missing data by creating multiple datasets from versions of the original dataset. It can be done by using a regression model, or a sequence of regression models, such as linear, logistic and Poison. A set of m plausible values are generated for each unobserved data point, resulting in M complete data sets (Dong and Peng 2013). The new values are randomly drawn from predictive distributions either through joint modeling (JM, which is not used much anymore) or fully conditional specification (FCS) (Wongkamthong and Akande 2023). It is then analyzed and the results are combined to obtain a single value for the missing data.

The purpose of multiple imputation is to create a pool of imputed data for analysis, but if the pooled results are lacking, then multiple imputation should not be done (Mainzer et al. 2023). Another reason not to use multiple imputation is if there are very few missing values; there may be no benefit in using it. Also worth noting is some statistical analyses software already have built-in features to deal with missing data.

Multiple imputation by chained methods, otherwise known as MICE, is the most common and preferred, method of multiple imputation (Wulff and Jeppesen 2017). It provides a more reliable way to analyze data with missing values. For this reason, this paper will focus on the methodology and application of the MICE process.

Code
#loading packages
library(DiagrammeR)

Figure 2: Flowchart of the MICE-process based on procedures proposed by Rubin (Wulff and Jeppesen 2017)

Code
DiagrammeR::grViz("digraph {

# initiate graph
graph [layout = dot, rankdir = LR, label = 'The MICE-Process\n\n',labelloc = t, fontcolor = DarkSlateBlue, fontsize = 45]

# global node settings
node [shape = rectangle, style = filled, fillcolor = AliceBlue, fontcolor = DarkSlateBlue, fontsize = 35]
bgcolor = none

# label nodes
incomplete [label =  'Incomplete data set']
imputed1 [label = 'Imputed \n data set 1']
estimates1 [label = 'Estimates from \n analysis 1']
rubin [label = 'Rubin rules', shape = diamond]
combined [label = 'Combined results']
imputed2 [label = 'Imputed \n data set 2']
estimates2 [label = 'Estimates from \n analysis 2']
imputedm [label = 'Imputed \n data set m']
estimatesm [label = 'Estimates from \n anaalysis m']


# edge definitions with the node IDs
incomplete -> imputed1 [arrowhead = vee, color = DarkSlateBlue]
imputed1 -> estimates1 [arrowhead = vee, color = DarkSlateBlue]
estimates1 -> rubin [arrowhead = vee, color = DarkSlateBlue]
incomplete -> imputed2 [arrowhead = vee, color = DarkSlateBlue]
imputed2 -> estimates2 [arrowhead = vee, color = DarkSlateBlue]
estimates2-> rubin [arrowhead = vee, color = DarkSlateBlue]
incomplete -> imputedm [arrowhead = vee, color = DarkSlateBlue]
imputedm -> estimatesm [arrowhead = vee, color = DarkSlateBlue]
estimatesm -> rubin [arrowhead = vee, color = DarkSlateBlue]
rubin -> combined [arrowhead = vee, color = DarkSlateBlue]
}")

*Rubin’s Rules: Average the estimates across m estimates. Calculate the standard errors and variance of m estimates. Combine using an adjustment term (1+1/m).

Other Methods of Imputation

There are other methods of imputation worth noting and are briefly descrbied below.

Regression Imputation is based on a linear regression model. Missing data is randomly drawn from a conditional distribution when variables are continuous and from a logistic regression model when they are categorical (van_Ginkel et al. 2020).

Predictive Mean Matching is also based on a linear regression model. The approach is the same as regression imputation when it comes to categorical missing values but different for continuous variables. Instead of random draws from a conditional distribution, missing values are based on predicted values of the outcome variable (van_Ginkel et al. 2020).

Hot Deck (HD) imputation is when a missing value is replaced by an observed response of a similar unit, also known as the donor. It can be either random or deterministic, which is based on a metric or value (Thongsri and Samart 2022). It does not rely on model fitting.

Stochastic Regression (SR) Imputation is an extension of regression imputation. The process is the same but a residual term from the normal distribution of the regression of the predictor outcome is added to the imputed value (Thongsri and Samart 2022). This maintains the variability of the data.

Random Forest (RF) Imputation is based on machine learning algorithms. Missing values are first replaced with the mean or mode of that particular variable and then the dataset is split into a training set and a prediction set (Thongsri and Samart 2022). The missing values are then replaced with predictions from these sets. This type of imputation can be used on continuous or categorical variables with complex interactions.

Methodology

Multiple Imputation by Chained Equations (MICE)

According to Rubin’s Rule, in multiple imputation m imputed values are created for each of the missing data and result in M complete datasets. For each of the M datasets, an estimate of \(\theta\) is acquired.

The combined estimator of \(\theta\) is given by:

\({\hat{\theta}}_{M}\)=\(\displaystyle \frac{1}{M}\)\(\sum_{m = 1}^{M} {\hat{\theta}}_{m}\)

${}

The proposed variance estimator of \({\hat{\theta}}_{M}\) is given by:

\({\hat{\Phi}}_{M}\) = \({\overline{\phi}}_{M}\)+(1+\(\displaystyle \frac{1}{M}\))B\(_{M}\)

where \({\overline{\phi}}_{M}\) = \(\displaystyle \frac{1}{M}\)\(\sum_{m = 1}^{M}\)\({\hat{\phi}}_m\)

and B\(_{M}\) = \(\displaystyle \frac{1}{M-1}\)\(\sum_{m = 1}^{M}\)(\({\hat{\theta}}_{m}\)-\({\overline{\theta}}_{M}\))\(^{2}\)

(Arnab 2017)

Assumptions

All multiple imputation methods have the following assumptions:

  • Observed data follow a normal distribution.

  • Missing data are classified as MAR, which is the probability that a missing value depends only on observed values and not unobserved values.

These assumptions are essential otherwise we wouldn’t be able to predict a plausible replacement value for the missing data.

Steps:

The chained equation process has the following steps (Azur et al. 2011):

Step 1: Using simple imputation, replace the missing data with this value, referred to as the “place holder”.

Step 2: The “place holder” values for one variable are set back to missing.

Step 3: The observed values from this variable (dependent variable) are regressed on the other variables (independent variables) in the model, using the same assumptions when performing linear, logistic, or Poison regression.

Step 4: The missing values are replaced with predictions “m” from this newly created model.

Step 5: Repeat Steps 2-4 for each variable that have missing values until all missing values have been replaced.

Step 6: Repeat Steps 2-4, updating imputations each cycle for as many “m” cycles/imputations that are required.

Analysis and Results

Data and Visualizations

Load Data and Packages
Code
# load libraries
library(gtsummary)
library(gt)
library(ggplot2)
library(dplyr, warn.conflicts=FALSE)
library(mice, warn.conflicts=FALSE)

# load data
original = read.csv("kc_house_data.csv")
Description of Dataset

kc_house_data.csv

Details of Dataset

This dataset contains the sales price of 21,613 houses for King County, in Seattle between May 2014 to May 2015. The original dataset contained 21 columns with various selling attributes. For the purposes of this project, we have condensed the variables to 4, including sale price, number of bedrooms, number of bathrooms, and square feet of living space.

Definition of Data in Dataset
Variable Type Description
price numeric Sales price of homes sold between May 2014 and May 2015 in King County, Seattle.
bedrooms integer Number of bedrooms in homes sold between May 2014 and May 2015 in King County, Seattle.
bathrooms numeric Number of bathrooms in homes sold between May 2014 and May 2015 in King County, Seattle.
sqft_living integer Number of square feet of living space in homes sold between May 2014 and May 2015 in King County, Seattle.
Summary of Dataset:
Code
original %>% 
  tbl_summary(missing_text = "NA") %>%
  add_n() %>%
  modify_header(label ~ "**Variable**") %>%
  modify_caption("**Summary of Data**") %>%
  bold_labels() %>%
  as_gt() %>% 
  tab_options(column_labels.background.color = "#FFFFFF00",
    table.background.color = "#FFFFFF00")
Summary of Data
Variable N N = 21,6131
price 21,613 450,000 (321,950, 645,000)
bedrooms 21,613 3 (3, 4)
bathrooms 21,613 2.25 (1.75, 2.50)
sqft_living 21,613 1,910 (1,427, 2,550)
1 Median (IQR)
Plotting of Data
Code
ggplot(original, aes(x=sqft_living, y=price)) +
  geom_point(alpha=0.5, color='darkblue') +
  labs(x="sqft_living", y="price")

MICE in R

Using the MICE (Multivariate Imputation by Chained Equations) package in R, a statistical programming software, we will create multiple datasets with imputed values for the missing data. We will be deleting some data points from our dataset to create missing values. We will run through the MICE process and provide results of the imputations. To check for accuracy of the imputations, we will then compare these results to the original, complete data set.

Evaluate Dataset

First, we evaluate the dataset for missing values by using the is.na() function:

Code
# check for missing data
sapply(original, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
          0           0           0           0 

Create Missing Data

There is no missing data, so we will artificially create some by duplicating the dataset and randomly removing some variables.

Code
# duplicate dataset
house <- original

# remove some values
set.seed(10)
house[sample(1:nrow(house), 200), "sqft_living"] <- NA
house[sample(1:nrow(house), 200), "bedrooms"] <- NA
house[sample(1:nrow(house), 200), "bathrooms"] <- NA
house[sample(1:nrow(house), 100), "price"] <- NA

We need to confirm the presence of missing data:

Code
sapply(house, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
        100         200         200         200 

Our dataset now has 700 NA/missing values. This equates to about 3% of the data (700/21,613). If we were to just delete these data points, we would be discarding valuable information.

Checking for Patterns

Next, we need to check the missingness by looking for patterns in the datasetb by using the md.pattern() function:

Code
library(VIM)
md.pattern(house, rotate.names = TRUE)

      price bedrooms bathrooms sqft_living    
20920     1        1         1           1   0
195       1        1         1           0   1
196       1        1         0           1   1
2         1        1         0           0   2
197       1        0         1           1   1
2         1        0         1           0   2
1         1        0         0           1   2
98        0        1         1           1   1
1         0        1         1           0   2
1         0        1         0           1   2
        100      200       200         200 700

Blue are the observed values, and red are the missing values. There are 20,920 observations that are not missing data. The bottom values are difficult to see because they are squished together. We know from the previous table that they are the total number of missing data for each variable: price = 100, bedrooms = 200, bathrooms = 200, sqft_living = 200, which equal the 700 total missing values.

Code
aggr_plot <- aggr(house, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))


 Variables sorted by number of missings: 
    Variable       Count
    bedrooms 0.009253690
   bathrooms 0.009253690
 sqft_living 0.009253690
       price 0.004626845

This is just another way to look for patterns in the missing data.

Code
marginplot(house[c(1,4)])

The red box plot on the left shows the distribution of sqft_living with price missing. The blue box plot shows the distribution of the remaining datapoints. The same is true for the price box plots on the bottom. We expect the red and blue boxplots to be very similar if our assumptions of MCAR data are true. In this case, they are.

Imputation Process

Since we are missing about 3% of our data, we need to perform 5 imputations. This can be done by using the mice() function. 5 multiple imputations is the default; if you want something different, you must set m = to the number of imputations that you desire. The set.seed will be given the value 1337 (any number can be used here) to retrieve the same results each time the multiple imputation is performed.

Code
imp = mice(data = house, set.seed = 1337)

 iter imp variable
  1   1  price  bedrooms  bathrooms  sqft_living
  1   2  price  bedrooms  bathrooms  sqft_living
  1   3  price  bedrooms  bathrooms  sqft_living
  1   4  price  bedrooms  bathrooms  sqft_living
  1   5  price  bedrooms  bathrooms  sqft_living
  2   1  price  bedrooms  bathrooms  sqft_living
  2   2  price  bedrooms  bathrooms  sqft_living
  2   3  price  bedrooms  bathrooms  sqft_living
  2   4  price  bedrooms  bathrooms  sqft_living
  2   5  price  bedrooms  bathrooms  sqft_living
  3   1  price  bedrooms  bathrooms  sqft_living
  3   2  price  bedrooms  bathrooms  sqft_living
  3   3  price  bedrooms  bathrooms  sqft_living
  3   4  price  bedrooms  bathrooms  sqft_living
  3   5  price  bedrooms  bathrooms  sqft_living
  4   1  price  bedrooms  bathrooms  sqft_living
  4   2  price  bedrooms  bathrooms  sqft_living
  4   3  price  bedrooms  bathrooms  sqft_living
  4   4  price  bedrooms  bathrooms  sqft_living
  4   5  price  bedrooms  bathrooms  sqft_living
  5   1  price  bedrooms  bathrooms  sqft_living
  5   2  price  bedrooms  bathrooms  sqft_living
  5   3  price  bedrooms  bathrooms  sqft_living
  5   4  price  bedrooms  bathrooms  sqft_living
  5   5  price  bedrooms  bathrooms  sqft_living
Code
# init = mice(house, maxit=0) 
# meth = init$method
# predM = init$predictorMatrix
Code
# set.seed(103)
# imputed = mice(house, method=meth, predictorMatrix=predM, m=5)

We need to create the imputed dataset:

Code
imputed <- complete(imp)

We need to check and make sure there is no missing data:

Code
sapply(imputed, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
          0           0           0           0 

We can check the quality of the imputations by running a strip plot, which is a single axis scatter plot. It will show the distribution of the variable per imputed data set. We want the imputations to be values that could have been observed had the data not been missing.

Code
par(mfrow=c(2,2))
stripplot(imp, price, pch = 19, cex=1, xlab = "Imputation number")

Code
stripplot(imp, bedrooms, pch = 19, cex=1, xlab = "Imputation number")

Code
stripplot(imp, bathrooms, pch = 19, cex=1, xlab = "Imputation number")

Code
stripplot(imp, sqft_living, pch = 19, cex=1, xlab = "Imputation number")

Checking for Accuracy

We will now compare this dataset to our original dataset to check for accuracy of the imputation method by chanied equations:

Code
# price
actual <- original$price[is.na(house$price)]
predicted <- imputed$price[is.na(house$price)]
mean(actual)
[1] 512963.6
Code
mean(predicted)
[1] 515023.7
Code
# bedrooms
actual <- original$bedrooms[is.na(house$bedrooms)] 
predicted <- imputed$bedrooms[is.na(house$bedrooms)] 
table(actual)
actual
  0   2   3   4   5   6 
  1  23 108  57   9   2 
Code
table(predicted)
predicted
 0  1  2  3  4  5  6  7 
 1  2 24 88 65 16  3  1 
Code
# bathrooms
actual <- original$bathrooms[is.na(house$bathrooms)] 
predicted <- imputed$bathrooms[is.na(house$bathrooms)] 
table(actual)
actual
   1  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75    4 
  40   15   34   12   23   48    9    4    6    6    2    1 
Code
table(predicted)
predicted
0.75    1  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75    4 
   2   38   12   25   13   25   52   12    5    5    9    1    1 
Code
# sqft_living
actual <- original$sqft_living[is.na(house$sqft_living)] 
predicted <- imputed$sqft_living[is.na(house$sqft_living)] 
table(actual)
actual
 690  800  840  860  880  900  910  920  930  950  970  980  998 1000 1010 1060 
   1    1    1    1    1    1    1    1    1    2    1    1    1    2    1    1 
1110 1120 1140 1160 1200 1250 1260 1270 1280 1310 1330 1340 1350 1360 1370 1380 
   1    1    1    1    4    2    1    2    1    1    1    2    2    3    1    1 
1400 1410 1420 1440 1451 1470 1480 1490 1510 1540 1560 1570 1580 1590 1620 1640 
   1    1    1    2    1    2    3    2    3    2    1    1    2    2    1    2 
1680 1690 1710 1720 1740 1770 1780 1785 1790 1820 1830 1840 1850 1860 1870 1890 
   1    2    1    2    1    1    4    1    2    1    1    3    2    1    2    1 
1900 1920 1930 1950 1982 1990 2000 2010 2040 2050 2060 2070 2090 2100 2120 2140 
   3    2    3    2    1    1    1    1    1    2    1    1    1    3    1    1 
2150 2170 2180 2190 2200 2210 2220 2230 2270 2290 2320 2330 2340 2360 2370 2420 
   1    1    3    3    3    2    1    1    2    2    1    1    1    2    2    2 
2450 2460 2470 2520 2540 2550 2560 2570 2580 2600 2640 2660 2740 2760 2800 2820 
   1    1    1    1    1    4    1    2    1    1    1    2    2    1    1    1 
2840 2860 2900 2930 2963 2970 3180 3220 3230 3240 3280 3361 3430 3450 3510 3550 
   1    1    1    1    1    1    1    1    1    1    2    1    2    1    1    2 
3560 3670 3870 3890 4130 4320 4360 4700 4750 5470 
   1    1    1    1    1    1    1    1    1    1 
Code
table(predicted)
predicted
 600  730  770  790  840  850  860  880  890  970 1030 1060 1070 1120 1130 1140 
   1    1    1    1    1    1    1    2    1    1    1    1    3    1    1    1 
1150 1160 1190 1200 1210 1240 1250 1270 1290 1300 1310 1320 1330 1340 1370 1380 
   1    1    1    2    1    2    1    1    1    1    1    1    1    1    1    1 
1400 1410 1430 1440 1470 1480 1500 1510 1530 1550 1570 1580 1590 1600 1610 1620 
   1    2    1    2    2    1    1    1    2    1    2    1    1    2    2    1 
1640 1650 1658 1660 1670 1700 1720 1730 1740 1760 1770 1780 1790 1800 1820 1830 
   1    2    1    4    1    1    1    1    1    1    1    2    3    1    1    4 
1840 1860 1870 1880 1890 1900 1920 1930 1940 1950 1960 1970 1980 2000 2007 2014 
   1    1    1    2    1    2    2    1    3    1    2    1    2    1    1    1 
2030 2040 2050 2056 2060 2070 2080 2090 2100 2110 2120 2140 2150 2153 2170 2200 
   1    1    2    1    1    1    1    1    1    2    1    1    2    1    1    1 
2210 2230 2242 2257 2260 2270 2320 2370 2380 2400 2410 2440 2450 2470 2480 2490 
   1    1    1    1    1    1    1    1    1    2    1    1    1    1    2    2 
2500 2530 2550 2570 2580 2590 2600 2630 2650 2658 2660 2670 2680 2692 2708 2710 
   1    1    4    1    1    1    1    2    1    1    1    2    3    1    1    1 
2730 2790 2810 2850 2860 2930 2970 2990 3060 3150 3260 3300 3310 3320 3420 3480 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
3510 3620 3800 3850 3870 3910 3960 4070 4200 4440 4650 4800 
   1    1    1    1    1    1    1    3    1    1    1    1 

The mean of the actual sales price to the predicted sales price is very close (about $2,000), and the other variables are very close as well, which indicates a high accuracy of imputation.

Pooling and Fitting Imputed Dataset

Next, we will pool the results of the complete dataset with the imputed dataset to arrive at estimates that will properly account for the missing data. We fit the complete model with the with() function and display the summary of the pooled results. It will give us the estimate, standard error, test statistic, degrees of freedom, and the p-value for each variable.

Code
# fit complete-data model
fit <- with(imp, lm(price~bedrooms+bathrooms+sqft_living))

# pool and summarize the results
summary(pool(fit))
         term    estimate   std.error  statistic        df       p.value
1 (Intercept)  75043.7775 6952.563287  10.793685 14431.278  4.677205e-27
2    bedrooms -57930.4966 2357.322483 -24.574702  7683.745 1.936143e-128
3   bathrooms   7624.2182 3531.505441   2.158914 11516.540  3.087739e-02
4 sqft_living    309.7992    3.114475  99.470779  8431.335  0.000000e+00

We now have a full, complete dataset that we can analyze! Now we will compare it to the original full dataset to see how accurate our imputation method was.

Code
og_lm = lm(price~bedrooms+bathrooms+sqft_living, data = original)
summary(og_lm)

Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = original)

Residuals:
     Min       1Q   Median       3Q      Max 
-1644794  -144361   -22891   102420  4178611 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  74662.099   6917.977  10.792   <2e-16 ***
bedrooms    -57906.631   2336.062 -24.788   <2e-16 ***
bathrooms     7928.708   3512.744   2.257    0.024 *  
sqft_living    309.605      3.089 100.238   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 258000 on 21609 degrees of freedom
Multiple R-squared:  0.5069,    Adjusted R-squared:  0.5069 
F-statistic:  7406 on 3 and 21609 DF,  p-value: < 2.2e-16

The results are very similar. The differences are very minimal. We can conclude that multiple imputation by chained equation is a reliable source to impute missing data.

References

Arnab, R. 2017. Survey Sampling Theory and Applications. Academic Press. https://www.sciencedirect.com/topics/mathematics/imputation-method.
Azur, M. J., E. A. Stuart, C. Frangakis, and P. J. Leaf. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” Int J Methods Psychiatr Res. 20 (1): 40–49. https://onlinelibrary.wiley.com/doi/epdf/10.1002/mpr.329.
Cao, Y., H. Allore, B. V. Wyk, and Gutman R. 2021. “Review and Evaluation of Imputation Methods for Multivariate Longitudinal Data with Mixed-Type Incomplete Variables.” Statistics in Medicine 41 (30): 5844–76. https://doi-org.ezproxy.lib.uwf.edu/10.1002/sim.9592.
Dong, Y., and C. J. Peng. 2013. “Principled Missing Data Methods for Researchers.” SpringerPlus 2 (222). https://doi.org/10.1186/2193-1801-2-222.
Mainzer, R., M. Moreno-Betancur, C. Nguyen, J. Simpson, J. Carlin, and K. Lee. 2023. “Handling of Missing Data with Multiple Imputation in Observational Studies That Address Causal Questions: Protocol for a Scoping Review.” BMJ Open 13: 1–6. http://dx.doi.org/10.1136/bmjopen-2022-065576.
Pedersen, A. B., E. M. Mikkelsen, D. Cronin-Fenton, N. R. Kristensen, T. M. Pham, L. Pedersen, and I. Petersen. 2017. “Missing Data and Multiple Imputation in Clinical Epidemiological Research.” Clinical Epidemiology 9: 157–66. https://www.tandfonline.com/doi/full/10.2147/CLEP.S129785.
Schafer, J. L., and J. W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7 (2): 147–77. https://psycnet.apa.org/doi/10.1037/1082-989X.7.2.147.
Streiner, D. L. 2008. “Missing Data and the Trouble with LOCF.” EBMH 11 (1): 1–5. http://dx.doi.org/10.1136/ebmh.11.1.3-a.
Thongsri, T., and K. Samart. 2022. “Composite Imputation Method for the Multiple Linear Regression with Missing at Random Data.” International Journal of Mathematics and Mathematics and Computer Science 17 (1): 51–62. http://ijmcs.future-in-tech.net/17.1/R-Samart.pdf.
van_Ginkel, J. R., M. Linting, R. C. Rippe, and A. van der Voort. 2020. “Rebutting Existing Misconceptions about Multiple Imputation as a Method for Handling Missing Data.” Journal of Personality Assessment 102 (3): 2812–31. https://doi.org/10.1080/00223891.2018.1530680.
Wongkamthong, C., and O. Akande. 2023. “A Comparative Study of Imputation Methods for Multivariate Ordinal Data.” Journal of Survey Statistics and Methodology 11 (1): 189–212. https://doi.org/10.1093/jssam/smab028.
Wulff, J. N., and L. E. Jeppesen. 2017. “Multiple Imputation by Chained Equations in Praxis: Guidelines and Review.” Electronics Journal of Business Research Methods 15 (1): 41–56. https://vbn.aau.dk/ws/files/257318283/ejbrm_volume15_issue1_article450.pdf.